Tan et al. Genome Biology 201 1, 12:R35 
http://genomebiology.com/201 1/1 2/4/R35 



^ Genome Biology 



METHOD 



Open Access 



An optimized microarray platform for assaying 
genomic variation in Plasmodium falciparum field 
populations 

John C Tan^^ Becky A Miller^^ Asako Tan\ Jigar J Patel^'^, Ian H Cheesennan^ Tinn JC Anderson^ 
Magnus Manske^ Gareth Maslen"^, Dominic P Kwiatkowski"^ and Michael T Ferdig^'^" 



Abstract 

We present an optimized probe design for copy number variation (CNV) and SNP genotyping in the Plosmodium 
falciporum genome. We demonstrate that variable length and isothermal probes are superior to static length 
probes. We show that sample preparation and hybridization conditions mitigate the effects of host DNA 
contamination in field samples. The microarray and workflow presented can be used to identify CNVs and SNPs 
with 95% accuracy in a single hybridization, in field samples containing up to 92% human DNA contamination. 



Background 

Plasmodium falciparum is the intracellular parasite 
responsible for the majority of the world's malaria mor- 
bidity and mortality burden in humans, causing an esti- 
mated 243 million episodes of malaria and 863,000 
deaths each year [1]. Efforts to control and eradicate 
malaria are hampered by the accelerated evolution of 
drug resistance in the parasite. To date, the parasite has 
developed resistance to all major antimalarial drugs, 
raising concerns about the spread of drug-resistant para- 
sites and the ability to effectively treat malaria [2]. The 
development of new technologies aimed at understand- 
ing parasite genome variability provides hope in identi- 
fying new drug targets, implementing smarter treatment 
plans, and ultimately reducing or eliminating the burden 
of malaria. 

Genome variation such as SNPs and copy number var- 
iation (CNV) underpins P, falciparum drug resistance. 
The primary determinant of chloroquine resistance is a 
mutation in the P, falciparum chloroquine resistance 
transporter gene on chromosome (chr) 7 [3,4]. In vitro 
resistance to the antifolate drugs sulfadoxine and pyri- 
methamine increases in a step-wise manner as mutations 
accrue in dihydrofolate reductase and dihydropteroate 
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synthase [5-8]. Varying copy number of the P, falciparum 
multidrug resistance 1 gene on chr 5 influences parasite 
susceptibility to a range of antimalarial drugs, including 
mefloquine, lumefantrine, quinine, and artemisinin 
[9-11]. Amplification on chr 12 of GTP cyclohydrolase 1 
of the folate biosynthesis pathway is correlated with anti- 
folate drug resistance [12,13]. These examples emphasize 
the importance of genomic variation in drug resistance 
and need to assay both SNPs and CNV genome-wide in 
the malaria parasite. 

Microarrays provide a relatively fast and inexpensive 
way of examining genomic variation in P. falciparum 
[14]. Array comparative genomic hybridization (CGH) 
has been successfully used to look at structural variation 
and CNV in multiple P, falciparum strains [12,15-17], 
while large-scale sequencing efforts identifying SNPs 
[18-20] have spurred the development of SNP microar- 
rays. Neafsey et al. [21] genotyped 1,638 out of 3,000 
queried SNPs with 100% accuracy using an Affymextrix 
3K SNP assay. Mu et al. [22] used Affymetrix molecular 
inversion probe technology to genotype 2,763 of 3,354 
SNPs with >90% call rate. Multiple groups have success- 
fully applied CGH for SNP detection with 80 to 90% 
sensitivity to approximately 3,000 SNPs [23,24] and 
identified parameters influencing SNP detection [24,25]. 
However, the reported detection rates are based on a 
core subset of SNPs (approximately 3,000) in a genome 
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with more than 100,000 cataloged SNPs (PlasmoDB 
v5.5) [26]. 

Our central goal was to develop a single microarray 
platform that can assay CNV and genotype SNPs simul- 
taneously and to optimize this platform for the chal- 
lenges of monitoring monoclonal patient blood samples 
from field studies. We first empirically determined the 
optimum probe lengths and melting temperatures for 
SNP genotyping in the 81% AT P. falciparum genome. 
This was used to guide the design of a single high-reso- 
lution genotyping microarray with variable length probes 
optimized for high quality SNP genotyping and CNV 
detection (Figure 1). One half of the microarray interro- 
gates 45,524 SNP loci using optimized resequencing 
probes 29 to 41 bp in length capable of making a base 
call at a precise nucleotide position [27]. The second 
half identifies CNV using tiled CGH probes 50 to 75 bp 
in length. We determine the reliability and accuracy of 
the CNV-SNP array using the laboratory lines 3D7, 
HB3, Dd2, SC05, and 7C126. We then validate the 



utility and robustness of the microarray using field sam- 
ples with limited parasite DNA and high human DNA 
contamination present using blood collected from 
humans at the Thailand-Burma border. 

Results 

Effects of probe length and probe melting temperature 
on the robustness of base calls 

Using a prototype 5K SNP chip, the performance of sta- 
tic probe lengths was compared to the performance of 
variable length, isothermal probes on base calling 
robustness. A base call is considered robust when a 
probe quartet has a single high nucleotide signal relative 
to the other three nucleotide signals, and the sense and 
antisense base calls are complementary. A Dscore calcu- 
lating the background noise relative to the highest signal 
intensity in a probe quartet (Figure 1, grey inset) was 
used to compare the performance of static probes and 
isothermal probes. A Dscore close to 1 indicates high 
background noise and poor discrimination ability 





Log^ ratio of cy3 and cy5 



Figure 1 Microarray layout and design. The microarray contains blocl<s of probesets for SNP genotyping and CGH. SNP genotyping probesets 
are composed of two probe quartets, one for eacli strand (red blowouts). Probes from one quartet typically have hybridization signals in a 
similar dynamic range. These signals are used to determine a base call and calculate base calling robustness, which is expressed as a Dscore 
where the two lesser hybridization signals are used to estimate background noise for each probe quartet (grey inset). The CGH probesets record 
data that are used to generate log2 ratios of a test and reference sample. 
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between probe signals, while a score close to 0 indicates 
low background noise and good discrimination ability. 
The mean Dscore of the 5K SNP chips was plotted for 
static probe lengths (Figure 2a) and for probe melting 
temperature (Figure 2b). Statistical analysis with a one- 
way ANOVA indicates significant differences between 
mean Acore at various probe lengths {P < 0.0001), and a 
Tukey's multiple comparison test indicates that all 
probe lengths except 39- and 41-mers have a signifi- 
cantly different mean Dscore (P < 0.05). Out of the nine 
tested probe lengths, 39-mer probes generated the low- 
est mean Acore with the best discrimination ability (x = 
0.3575). A one-way ANOVA analysis indicates probe 
melting temperature Dscore are significantly different 
from one another {P < 0.0001), and a Tukey's multiple 
comparison test indicates melting temperature in the 
66°C range was significantly different from other melting 
temperatures {P < 0.05). The lowest mean Dscore with 
the best discrimination ability for probe melting tem- 
peratures was generated at 66°C from the range of 42 to 
82°C (x= 0.2647), and this outperformed any static 
probe lengths. Similar performance was seen when com- 
paring exons, introns, and intergenic regions (Figure SI 
in Additional file 1). 

Microarray base calling accuracy 

Microarray data for 3D7, HB3, Dd2, SC05, and 7C126 
(n = 15, 5, 3, 2, and 2, respectively) were compared to 
genome sequence data to ascertain base calling accuracy 
(Figure 3a). A useable base call was made at a SNP 



locus when the sense and antisense probesets indicated 
complementary bases; otherwise, there would be no 
base call for that SNP locus. Figure 3a depicts mean 
microarray accuracy plotted by Dscore> also depicted are 
CNV-SNP array data for the SNP subsets that are 
represented on SNP genotyping microarrays developed 
by the Broad Institute [21] and NIH [22]. The CNV- 
SNP array genotypes 1,507 of the 1,631 publicly avail- 
able Broad Institute SNPs and 2,621 of the 2,743 pub- 
licly available NIH SNPs. For all SNP sets, a lower 
Dscore is associated with higher accuracy. SNPs from the 
Broad Institute and NIH maintain >95% accuracy at all 
Dscore cutoffs with 97.1% accuracy and 98.7% accuracy, 
respectively, at a Dscore cutoff of 1. The accuracy of all 
SNPs assayed on the CNV-SNP array maintain >95% 
accuracy for Dscore ^0-9 but drops to an average accu- 
racy of 94.6% at a Dscore of 1. Figure 3b depicts the 
mean number of base calls made by the microarray 
plotted by Dscore for all SNPs and for the SNP subsets 
from the Broad Institute and NIH microarray platforms 
(Figure 3b). On average, a microarray hybridization 
yielded 36,948 base calls from the 45,524 assayed SNP 
loci for a base call rate of 81.2%. The Broad Institute 
and NIH SNP subsets had an average call rate of 90.7% 
and 93.0%, respectively. Lower numbers of base calls 
are made at more stringent Dscore cutoffs for all data 
sets. When comparing accuracy for SNP subsets in 
exons, introns, and intergenic regions, introns and inter- 
genic regions perform similarly, and exons exhibited the 
best performance (Table 1). 
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Figure 2 Base calling robustness is affected by probe length and melting temperature Mean Dscore is plotted by (a) probe length and (b) 
probe melting temperature where vertical lines indicate 95% confidence intervals. Lower discrimination scores indicate greater base calling 
robustness. Fixed length 39-mers provided the best performance for any static probe length tested. Probes with a 66°C melting temperature 
provided the best performance for any melting temperature range and surpassed the performance of 39-mer probes. 
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Figure 3 Microarray accuracy rate and base calling, (a) Microarray base calling accuracy rate was calculated by comparing microarray base 
calls with sequence data for five parasite genomes. At more stringent discrimination scores, accuracy rate increased. SNP subsets demonstrate 
greater performance than average if those SNPs are more amenable to hybridization-based interrogation (NIH and Broad subsets), (b) Mean 
number of base calls produced at various Dscore cutoffs for all SNP loci and for SNP subsets from previously published microarray platforms (NIH 
and Broad). 



Comparative genomic hybridization performance 

Segmentation analyses on multiple hybridizations of 
HB3 and Dd2 against the reference 3D7 detected the 
same copy number events detected in HB3 and Dd2 
hybridizations against 3D7 found previously [25]. Med- 
ian probe spacing between the 5' end of CGH probes is 
52 bp, providing a fine resolution view of CNV that can 
precisely implicate breakpoints to within 100 bp. The 
resolution provided by this platform is equivalent to a 
previous NimbleGen CGH chip [25]. Figure 4 shows 
CGH scatterplots demonstrating microarray-based CNV 
breakpoint detection in comparison with the exact 
breakpoint determined through capillary sequencing. 

CNV events are highly reproducible between replicate 
hybridizations with this microarray platform (Figure S2 
in Additional file 1). Features, including a 500 bp CNV 
event, were recognized and consistent between hybridi- 
zations; however, it becomes more difficult to confi- 
dently detect small CNV events algorithmically. CNV 
event detection is still possible with whole genome 
amplification (WGA) samples (Figure S3 in Additional 
file 1), although the amplification process introduces 
noise, confounding CNV detection by any platform, par- 
ticularly reducing confidence in small events. Degraded 
samples can not be recovered for effective CGH by 
WGA. 

Applications to P. falciparum field samples 

Standard probe labeling protocols utilize random non- 
amers with balanced base composition. However, P. 



falciparum has an extremely high AT content of 81% 
[28], which reduces the performance of 50% AT ran- 
dom nonamers and may introduce bias during amplifi- 
cation of the parasite genome. To test the effect of 
random nonamer AT composition on labeling perfor- 
mance and microarray data, labeling yields of 65% AT 
random nonamers (38,357 ± 3468.1 ng) were com- 
pared with labeling yields of 50% AT random nona- 
mers (18,865 ± 4530.7 ng). The data pass a 
D'Agostino-Pearson normality test and a paired ^-test, 
indicating that 65% AT random nonamers generate a 
significantly greater yield of labeled DNA than 50% AT 
random nonamers {P < 0.01). We see no adverse 
effects on base calling accuracy or CGH performance 
when using this modified labeling procedure. 

Larger yields of labeled DNA are generated by 65% 
AT random nonamers, allowing the amount of initial 
starting DNA to be reduced. We evaluated decreasing 
amounts of starting material that could generate the 
necessary 10 [ig of labeled DNA for hybridization. 
Labeling yields using 250 ng, 375 ng, 500 ng, and 1,000 
ng of starting DNA for 3D7, HB3, and Dd2 were 

Table 1 Microarray accuracy in exons, introns, and 
intergenic regions 



Location 


Accuracy, Dscore ^1-0 


Accuracy, Dscore ^0.5 


Exon 


96.3% 


98.5% 


Intron 


91.2% 


95.4% 


Intergenic region 


90.4% 


94.9% 
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Figure 4 CNV breakpoint detection with CGH. CGH data for a CNV on chr 5 in the Dd2 genome accurately detects the breakpoints for the 
(a) beginning and (b) end of the event within hundreds of base pairs. The vertical red lines indicate the breakpoint locations as previously 
determined through sequence data [43]. Nt, nucleotides. 



quantified (Figure 5a) and hybridized on the microarray. 
More than 10 [ig of labeled DNA was obtained from all 
labeling reactions and no base calling or CNV detection 
differences were seen between the different starting 
amounts, indicating a starting DNA amount <250 ng is 
sufficient to generate high quality hybridization data. 

Human DNA is invariably present in field-collected 
samples of parasite DNA and is especially high when 
leukocyte depletion is not used in the extraction 
method. In some cases, human DNA may constitute 
>90% of the total DNA extracted from infected blood 



samples and can hinder downstream uses of the parasite 
DNA for microarray hybridizations or sequencing. 
Nucleic acid blockers are commonly used in microarray 
hybridizations to prevent random probe binding to non- 
target nucleic acids and were tested on our microarray 
to prevent performance reduction in samples with sig- 
nificant amounts of human DNA contamination. We 
tested various nucleic acid blockers and found that Ix 
Denhardt's solution provided the greatest number of 
base calls with no negative impact on CGH data when 
compared to hybridizations with bovine serum albumin. 
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Figure 5 Field sample analysis with the CNV-SNP array, (a) The manufacturer-recommended starting amount is 1,000 ng of DNA to produce 
at least 10 jjg of labeled product. However, 250 ng of parasite DNA consistently produced sufficient labeled product when using 65% AT 
nonamers. Error bars indicate one standard deviation, (b) Hybridizations with field samples - straight from patient blood, or whole genome 
amplified - produced microarray data on par with standard lab clones, even when significant human DNA contamination was present. 
Microarray accuracy was determined through lllumina sequencing of lab-adapted parasites. Patient blood samples were hybridized with the 
addition of Ix Denhardt's solution while WGA samples were not. 
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dimethyl sulfoxide, human Cotl DNA, salmon sperm, 
and yeast tRNA. To test for the effect of human DNA 
contamination on our microarray, several DNA samples 
were extracted from Thailand-Burma patient blood sam- 
ples or were whole genome amplified from patient 
blood. The amount of human DNA present in each 
sample was quantified, and 33 to 92% of the total DNA 
was found to be human, with WGA samples containing 
the most human DNA. WGA is generally, but not 
always, helpful as there is some variability in data gener- 
ated from WGA samples. Using 250 ng of parasite 
DNA, field samples were hybridized to the microarray 
and examined for the number and accuracy of base calls 
(Figure 5b); accuracy was determined using lUumina 
sequences generated from the same samples (M Man- 
ske, unpublished observations). This platform is able to 
produce high quality data with samples containing 
extensive host DNA contamination equivalent to data 
from purified lab line DNA (Figure 5b). 

Discussion 

The CNV-SNP array provides robust, accurate data for 
both laboratory- and field-derived samples. Through 
optimizations described here, the CNV-SNP array over- 
comes many hurdles associated with molecular work on 
P. falciparum field samples. Lower starting amounts of 
DNA are possible when using 65% AT random nona- 
mers that compensate for the extreme AT bias of the 
genome. This optimization is especially useful for field 
sample DNA, which is typically scarce and difficult to 
obtain. It also eliminates the need for in vitro culture 
adaptation of field samples, which is typically used to 
generate enough DNA for applications like next-genera- 
tion sequencing and is known to alter CNV and skew 
results of CNV analyses [29,30]. Using our modified 
protocol, the CNV-SNP array requires no more than 
250 ng of starting parasite DNA with no compromise in 
data quality. Moreover, the ample yields of labeled DNA 
from 250 ng of starting parasite DNA indicate that the 
lower limit has not yet been defined, raising the possibi- 
lity that finger prick blood samples on filter paper are 
accessible to this technology. In addition, the CNV-SNP 
array is robust to samples with high host DNA contami- 
nation (>90%) with no drop in data quality, making 
microarray-based genotyping complementary to higher 
resolution next-generation sequencing that is sensitive 
to human DNA contamination in field samples, often 
requiring sample preprocessing for target DNA enrich- 
ment. Notably, high human DNA contamination and 
low amounts of parasite DNA present serious challenges 
to genotyping the large number of samples necessary for 
genome-wide association studies. 

Probe design optimizations contribute to the perfor- 
mance of this microarray for the P, falciparum genome. 



Sense and antisense resequencing probe quartets were 
used for SNP genotyping on the CNV-SNP array. A 
SNP call required that sense and antisense probe quar- 
tets made complementary calls; furthermore, the robust- 
ness of the base call was evaluated using the ratio of 
background signal versus the probe with the greatest 
signal intensity. Signal intensities within SNP probe 
quartets were more similar to each other than to probes 
in other probe quartets or between sense and antisense 
probe quartets of the same locus (Figure 1) and indi- 
cates the importance of measuring the background sig- 
nal for each individual SNP quartet - as provided by the 
resequencing probesets - rather than background noise 
from the entire array or locus. 

Resequencing probes were optimized for SNP geno- 
typing in P, falciparum by comparing the performance 
of probes at static lengths with probes balanced by melt- 
ing temperature on a prototype 5K SNP array. Probe 
melting temperature outperformed static probe lengths 
for optimal SNP detection at a probe melting tempera- 
ture of 66°C with performance that was reasonably con- 
sistent in exons, introns, and intergenic regions (Figure 
SI in Additional file 1). 

Our results on optimal probe length and melting tem- 
perature differ from findings in another study [31]. This 
is likely due to the use of different methods for calculat- 
ing probe melting temperature and our optimization to 
the AT-rich P. falciparum genome. However, our 
broader conclusion that variable length or isothermal 
probes provide optimal SNP detection is supported 
across various organisms [31,32], and indicate that 
longer, isothermal probes increase signal strength while 
also being short enough to remain sensitive to single 
base mismatches [32-35]. 

Resequencing probesets designed for a 66°C melting 
temperature were generated for 45,524 SNP loci for 
inclusion on the CNV-SNP array. While longer, isother- 
mal probes improve SNP genotyping, certain loci are 
more easily genotyped than others, and some remain 
inaccessible to microarrays and short-read next-genera- 
tion sequencing technologies. For instance, SNPs in 
exons have greater genotyping success than SNPs in 
introns or intergenic regions, likely due to regions of 
high AT richness or interspersed sequence repetitiveness 
that hinder probe design and binding specificity in 
intronic and intergenic regions. Current SNP genotyping 
microarrays, such as those developed by the NIH and 
the Broad Institute [21,22], are focused on high quality 
SNP loci that are easily genotyped across microarray 
platforms (Figure 3). However, the use of isothermal 
probes designed at an optimal melting temperature 
allows us to interrogate more difficult loci and maximize 
the overall number of SNPs that can be robustly geno- 
typed on the CNV-SNP array (on average, 36,948 
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useable SNP genotypes with 95% accuracy from a single 
hybridization). 

An interesting debate surrounds the continued value 
of microarrays with the emergence of next-generation 
sequencing. As the cost of next-generation sequencing 
continues to decrease and protocols continue to 
improve, we will see a realization of the platform pro- 
viding ultimate resolution and throughput, provoking 
the prediction that microarrays will soon be rendered 
obsolete. However, we suggest that the CNV-SNP array 
will continue to be useful as an 'everylab' tool alongside 
next-generation sequencing. Whole genome sequencing 
underpins the SNP discovery needed for chip design; in 
general, whole genome architecture and ultra-resolution 
mapping require fully sequenced and assembled gen- 
omes. The customizable microarray platform continues 
to improve in density (4.2 million element custom 
designs are anticipated in 2011) and offers unique con- 
figurations up to 12-plex of 135K probes, leading to a 
scenario in which a global set of SNPs identified by 
sequencing can be precisely represented on microarrays 
for regionally focused or hypothesis-driven designs. To 
date, microarrays remain cheaper, produce data more 
quickly, require less computational innovation, and are 
especially useful for processing large numbers of sam- 
ples, while producing sufficient resolution and quality 
for genome-wide association studies and population 
genomic analysis. Furthermore, although progress is 
being made in scoring CNV using next-generation 
sequencing, that technology still lags behind the perfor- 
mance of microarray CGH. 

Conclusions 

As P. falciparum continues to evolve and evade control 
and eradication efforts, high-throughput, cost-effective 
methods of monitoring genomic variation are critical to 
understanding parasite adaptation. The high AT content 
of the P, falciparum genome is technically challenging 
for most molecular methods; however, the flexibility of 
the microarray platform described here allows users to 
customize and optimize microarrays to individual gen- 
omes through alterations of probe lengths, types, and 
numbers and adjustment of hybridization strategy. This 
process is applicable to population genomic studies in a 
wide range of organisms. Utilizing this flexibility, we 
created a custom high-density CNV-SNP array contain- 
ing both resequencing probes capable of SNP genotyp- 
ing and CGH probes for CNV detection. The CNV-SNP 
array is a reliable, accurate platform that allows simulta- 
neous investigation of CNV and SNPs in a single hybri- 
dization. Its low cost, quick turn-around time, low DNA 
requirements, and resilience to human DNA contamina- 
tion make it a valuable tool for population genomic 
studies. 



Materials and methods 

Microarray design 
Probe length optimization 

As an initial step in SNP genotyping optimization, a 
NimbleGen resequencing microarray consisting of vari- 
able length resequencing probesets was developed to 
assay 5,347 SNP loci. A NimbleGen resequencing probe- 
set is composed of eight probes per SNP locus: four 
probes each for interrogation of sense and antisense 
strands. Each probe quartet is identical except for the 
central nucleotide that assays the nucleotide variant 
[27]. We downloaded 101,581 candidate SNP loci from 
PlasmoDB v5.5 [26] for parasite isolates HB3, Dd2, VI/ 
S, 7G8, DIO, FCC-2, Kl, RO-33, D6, GHANAl, FCB, 
and IT. Common SNPs (SNPs identified in at least two 
parasite isolates) were blasted against the 3D7 genome 
to verify a unique 21-mer SNP typing probe sequence. 
Probesets with more than one exact match in the gen- 
ome were discarded. Of the remaining candidate probe- 
sets, 5,347 SNP loci were randomly chosen for inclusion 
on the 5K SNP chip at nine different probe lengths: 21, 
25, 29, 33, 35, 37, 39, 41, and 45-mers. Hybridizations 
following NimbleGen standard CGH procedures [36] 
were performed using DNA from laboratory clones 3D7, 
HB3, Dd2, 7G8, and DIO. 
CNV-SNP array 

Using NimbleGen's 3-plex custom chip layout, each 
plex in our 3-plex 720K NimbleGen microarray con- 
tains resequencing probes for SNP genotyping and 
CGH probes for CNV detection (Figure 1). Probes 
were synthesized using maskless photolithography 
[37,38] with CGH probes attached by 5T linkers and 
resequencing probes attached with 15T linkers. Of the 
101,581 candidate SNP loci downloaded from Plas- 
moDB and BLASTed for uniqueness, probesets for all 
SNPs reported in at least two parasites lines were 
included in the microarray. Some SNPs represented by 
a single isolate were included on the array and priori- 
tized by mutation types sensitive to array detection 
[25]. In total, 45,524 SNP loci queried by 364,192 
probes balanced to 66°C melting temperature were 
included on the microarray. CGH microarray probes 
were designed using standard NimbleGen CGH proto- 
col [36] modified for the P. falciparum genome. 
Briefly, probes were tiled through the genome at 4-bp 
interval spacing and filtered for 60 to 80°C melting 
temperature and 50 to 75 bp length. The resulting 
probes were clustered with nearest neighbors and 
sorted to remove probes with extensive sequence iden- 
tity to any other probe. Probes with more than one 
50-mer exact match in the genome or located in 
hypervariable varlrifl stevor genes were discarded. The 
median spacing between the start of the 355,803 CGH 
probes is 52 bp. 
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Parasite samples 

Fresh cultures of cloned 3D7, HB3, Dd2, SC05, and 
7C126 lines derived from genotype-confirmed stock 
material were grown under standard cultivation condi- 
tions [39,40]. Parasite DNA was extracted using stan- 
dard phenol/chloroform extraction and concentrated by 
salt precipitation [39]. Parasite DNA from the Thailand- 
Burma border was collected from 5 ml whole blood of 
symptomatic patients visiting malaria clinics. Buffy coats 
from the blood samples were removed and infected red 
blood cells cultured for 24 h to allow ring stage para- 
sites to mature to schizonts to provide more DNA. 
DNA was extracted using a standard phenol/chloroform 
protocol. Parasite isolates were screened for multiple 
infections and identical clones using seven polymorphic 
microsatellite markers: ARA2 (chr 11), POLYa (chr 4), 
TAl (chr 6), C2M1 (chr 2), C3M54 (chr 3), TA60 (chr 
13), and C4M30 (chr 4). These markers were amplified 
using fluorescent end-labeled oligos and run on an ABI 
3100 capillary sequencer (Applied Biosystems Inc., Fos- 
ter City, CA, USA) and alleles scored using GeneScan 
(Applied Biosystems Inc.) and Genotyper (Applied Bio- 
systems Inc.) software. Samples were considered multi- 
ple clone infections if one or more of the seven 
microsatellite loci showed multiple alleles. Only unique 
genotypes were included in the study. Thailand-Burma 
samples with an inadequate amount of DNA were 
whole genome amplified using phi29 DNA polymerase 
(Fidelity Systems, Gaithersburg, MD, USA). WGA sam- 
ples were cleaned using a standard phenol/chloroform 
protocol. Parasite DNA concentrations in the Thailand- 
Burma samples were measured using quantitative PGR. 
Patient samples were amplified using SYBRgreen 
(Applied Biosystems Inc.) with primers specific to the P. 
falciparum amal gene. Reactions were run on an ABI 
PRISM 7900HT realtime PGR machine and DNA 
amounts calculated by comparison with a dilution series 
of pure DNA from parasite line 3D7. DNA from lab- 
adapted samples was submitted to the Wellcome Trust 
Sanger Institute (Hinxton, UK) for lUumina sequencing 
using 76-bp paired-end reads. 

Microarray hybridizations 

LabeUng and hybridization were conducted using stan- 
dard NimbleGen GGH procedures [36]. gDNA (250 ng 
to 1 (ig) was denatured at 98°G for 10 minutes in the 
presence of 1 OD of cy3 or cy5-labeled random nona- 
mers at 50% AT richness or 65% AT richness (TriLink 
Biotechnologies, San Diego, GA, USA). The denatured 
sample was quick chilled on ice and incubated with 50 
units of Klenow fragment (New England Biolabs, Ips- 
wich, MA, USA) and dNTP mix (6 mM each in TE 
(Sigma Aldrich, St Louis, MO, USA)) for 2 h at 37°G. 
Reactions were terminated with 0.5 M EDTA and 



precipitated with 5 M NaGl in isopropanol. Labeled pro- 
duct was resuspended in water, and 10 (ig of test and 
reference samples combined (6 (ig for 5K SNP chip 
samples), dried down, and resuspended in hybridization 
buffer (Roche NimbleGen, Inc., Madison, WI, USA); 
hybridizations for patient blood samples included Ix 
Denhardt's solution (Sigma Aldrich) in the hybridization 
buffer. The combined sample was denatured at 95°G for 
5 minutes and allowed to hybridize on the array for 24 
h (16 h for 5K SNP chip samples) at 42°G in a Nimble- 
Gen hybridization system (Roche NimbleGen, Inc.). 
Microarrays were washed sequentially in Wash Buffer I 
(2 minutes at room temperature). Wash Buffer II (1 
minutes at room temperature), and Wash Buffer III (15 
s at room temperature; Roche NimbleGen, Inc.) and 
dried for 1 minuets in a Microarray High-Speed Gentri- 
fuge (Arrayit Gorp., Sunnyvale, GA, USA). GNV-SNP 
arrays were scanned at 2 (im resolution using a Nimble- 
Gen MS 200 Microarray Scanner (Roche NimbleGen, 
Inc.). 5K SNP chips were scanned at 5 (im resolution 
using a GenePix Pro 4200A Scanner (Molecular Devices, 
Inc., Sunnyvale, GA, USA). Microarray data are depos- 
ited at Gene Expression Omnibus, accession number 
[GEO:GSE28287]. 

Microarray data analysis 
SNP analysis 

Data for 3D7, HB3, Dd2, SG05, and 7G126 were 
extracted from scanned images and resequencing base 
reports generated using NimbleScan v2.5 (Roche Nim- 
bleGen, Inc.). Base calls were made on resequencing 
probesets when the sense and antisense probes made 
complementary base calls. The discrimination score 
(Acore) for each probe quartet was calculated as a back- 
ground corrected ratio of the signal from the second 
greatest intensity probe from a probe quartet divided by 
the greatest intensity probe using custom perl scripts 
(Figure 1, grey insert): (Second highest signal intensity - 
Background)/(Highest signal intensity - Background). 
Background for each probe quartet was calculated as the 
average of the third and fourth highest signal intensities 
[32]. There is some noise inherent in this base calling 
method, which can be mitigated by performing technical 
replicates or incorporating probe replicates into the chip 
design. For SNP genotyping probes, probeset melting 
temperature was calculated as previously described [36] 
where mean melting temperature was calculated from 
all probes in each probe quartet. The accuracy of the 
base calls made by the resequencing probes was calcu- 
lated as the percentage of base calls that matched the 
reference genome for 3D7 or draft genome assemblies 
for HB3, Dd2, SG05, and 7G126 [20,28,41]. To ascertain 
the base called at a SNP locus in the genome assembly, 
resequencing probes were mapped to draft genome 
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assemblies using blastall, requiring at least 95% of the 
probe length with less than four mismatched/indel posi- 
tions excluding the central nucleotide. lUumina 
sequence data were aligned to the 3D7 reference gen- 
ome with SNP-o-matic software [42] to identify SNP 
locations for comparison to microarray data. 
Copy number variation analysis 

Data for HB3 and Dd2 (n = 6 and n = 4, respectively) 
hybridized against 3D7 were extracted from scanned 
images and normalized using NimbleScan v2.5 (Roche 
NimbleGen, Inc.). Copy number events from the seg- 
mentation analysis in HB3 and Dd2 against the refer- 
ence 3D7 were compared to known CNV in the 
published literature [25]. 

Additional material 



Abbreviations 

Bp: base pair; CGH: comparative ger^omic hybridizatior^; chr: chromosome; 
CNV: copy number variation; SNP: single nucleotide polymorphism; WGA: 
whole genome amplification. 
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